Multimodal Live Streaming
Vision Language Model
映像基盤モデル
Large Language Model
視覚文書理解
OCR
Gemini
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
https://github.com/OpenBMB/MiniCPM-o